Homework 3: Wikipedia Clustering
نویسندگان
چکیده
Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based on their features. An ideal clustering algorithm maximizes feature similarities within a cluster while minimizing the feature similarities across clusters. Some of the most common clustering algorithms include spectral clustering and k-means clustering. This project essentially consists of two main parts: extracting features into a bag-of-words representation and then performing clustering using these features.
منابع مشابه
CS 294 - 1 Homework 3 Timothy Hunter and Andre
In this assignment, the goal was to parse a large set of Wikipedia articles, to extract features into a sparse feature matrix and to cluster them with a clustering algorithm of our choice. One motivation to perform automated clustering on unstructured, unlabeled data is to detect correlations between data points; for instance, in the case of Wikipedia, one might be able to automatically group a...
متن کاملCategorization of Wikipedia Articles with Spectral Clustering
The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly.
متن کاملCSCE 313-200: Computer Systems
The goal of this project is to search for a given set of substrings in English Wikipedia, which exists on beefybox in four versions – tiny (50 MB), small (512 MB), medium (8 GB), and complete (28 GB). While Wikipedia does contain some UTF-8 characters, all target substrings in this homework are US ASCII (i.e., byte values below 128), which means that you will not have to perform any conversion ...
متن کاملMultilingual Document Clustering Using Wikipedia as External Knowledge
This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia, a structured multilingual knowledge base, has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have made an in-depth study on availing Wikipedia in enhancing MDC p...
متن کاملEvaluating the Performance of XML Document Clustering by Structure Only
This paper reports the results and experiments performed on the INEX 2006 Document Mining Challenge Corpus with the PCXSS clustering method. The PCXSS method is a progressive clustering method that computes the similarity between a new XML document and existing clusters by considering the structures within documents. We conducted the clustering task on the INEX and Wikipedia data sets.
متن کامل